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Abstract: We propose a sparse coefficient estimation and automated model selection pro- 
cedure for autoregressive (AR) processes with heavy-tailed innovations based on penal- 
ized conditional maximum likelihood. Under mild moment conditions on the innovation 
processes, the penalized conditional maximum likelihood estimator (PCMLE) satisfies a 
strong consistency, Op(Af~^/^) consistency, and the oracle properties, where N is the sam- 
ple size. We have the freedom in choosing penalty functions based on the weak conditions 
on them. Two penalty functions, least absolute shrinkage and selection operator (LASSO) 
and smoothly clipped average deviation (SCAD), are compared. The proposed method 
provides a distribution-based penalized inference to AR models, which is especially useful 
when the other estimation methods fail or under perform for AR processes with heavy- 
tailed innovations [13]. A simulation study confirms our theoretical results. At the end, 
we apply our method to a historical price data of the US Industrial Production Index for 
consumer goods, and obtain very promising results. 
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1. Introduction 

The autoregressive ( AR) (p) process is one of the most fundamental time series models that have 
been extensively studied and applied in different fields. One major role that AR models play in 
the analysis of time series is the use of autoregressive representation of a stationary time series. 
While theoretically, such representation "will give answers to many problems" ([!]), in practice, 
however, any AR process, being an approximation to what is observed in reality, must allow for 
an arbitrary magnitude of the order p, in order to achieve a satisfying approximation (see e.g. 
[24]). Autoregressive moving average (ARM A) process is one of such "stationary time series" 
that can be represented by an infinite order AR process. Inferences to the ARMA models are 
usually made by fitting a long-order AR model to the data, which is viewed as a truncation of 
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the AR(oo) representation. See [27], [15], [16], [20], among others. Moreover, the need for long 
range dependency in the economic and financial data analysis also calls for the application of 
long-order AR processes. For instance, the autoregression-based approximation to the autore- 
gressive fractionally integrated moving average (ARFIMA) processes is considered as an efficient 
and desirable method to make inferences of the long-memory ARFIMA models. See [16] and 
[24], for a partial list of references. Nevertheless, traditional model selection procedures based on 
criteria such as FPE [3], AIC [2] and BIC [26] are not efficient in fitting long order AR processes, 
especially when the AR process has a sparse structure. 

In this paper, we propose an automated and efficient model selection procedure which is based 
on penalized conditional maximum likelihood for AR processes. The shrinkage estimators have 
a long history. See [28], [25] and [5], for examples. Technically, such estimators could obtain 
the shrinkage feature via the minimization of a loss function plus a penalty term, with the loss 
function being the least squares or the negative log likelihood in the usual cases. The existence 
of a suitably chosen penalty induces zero elements in the estimates, resulting in a simultaneous 
model selection procedure, while the parameters are being estimated. In the past two decades, a 
great deal of literature has been devoted to investigating such techniques, and a large number of 
penalty functions have been proposed including LASSO [29], SCAD [10], adaptive LASSO [33]. 
See [11], [12], [19], [32], [21], and [31] for a partial list of references. Although these techniques 
have been thoroughly studied and widely applied in the independent data settings, their perfor- 
mances in the time series context have not been studied very much. In this paper, we propose a 
penalized sparse estimation for AR (p) models and thus develop a new model selection procedure. 
Based on conditional likelihood, our PCMLE is especially useful when the time series model has 
heavy-tailed innovations. This is striking since the regular methods fail or under perform for AR 
processes with heavy-tailed innovations [13]. Asymptotic properties of our PCMLE, regarding 
both estimation accuracy and model selection consistency, are investigated under the general 
conditional likelihood framework and mild conditions for the innovations and the penalty func- 
tions. 

Our theoretical results are two-fold. First, we give strong consistency of the PCMLE in The- 
orem 1 under weak conditions on the innovations and the penalty functions. In particular, we 
only require the sequence of penalty functions to be uniformly equicontinuous and converging 
to zero. The conditional maximum likelihood estimators with either LASSO or SCAD penalties 
enjoy the strong consistency. Second, we show that under certain regularity conditions on the 
innovations and the penalty functions, including the existence of the fourth moment of the in- 
novations, the PCMLE of the coefficients are N~^/'^ consistent in probability. Furthermore, we 
derive the what have been known as "oracle properties" in the literature in Theorem 2 for this 
N^^/^ consistent PCMLE: 1) The coefficients whose true values are zero are estimated to be 
exactly zero with probability going to one. This property, referred to as sparsity, guarantees that 
the optimal model will be chosen with probability going to one. 2) The PCMLEs for the non- 
zero coefficients satisfy a multivariate central limit theorem, which states that asymptotically 
the estimated non-zero coefficients obtain the same efficiency as if the true sparse structure were 
known in advance. This immediately relaxes the constraint on the magnitude of the order p, as 
enlarging p will no longer bring in proportionally more burden on the estimation efficiency. The 
PCMLE with SCAD penalty, but not with LASSO penalty, have the oracle properties. All these 
properties are confirmed by simulations with Gaussian and non-Gaussian innovations. Finally, 
we give a detailed discussion and rule of thumb on how the sample size should be adjusted, in 
order to minimize small sample risk and achieve optimal performances. 
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In this paper, we shall use the following conventions: the notation || • || is used for the L2 
norm; the notation ^ denotes weak convergence; X„ = op(l) is used for the convergence to 
zero in probability; and the bold face letters denote vectors. Besides, we denote by X„ = Op(l) 
a sequence of random variables bounded in probability (see e.g.. Definition 3.3, [30]). 

Throughout the paper, we assume the order p is fixed, and does not increase with sample size N. 

The rest of the paper is organized as follows: Section 2 formally introduces our methodology 
and results. We discuss the performances of the PCMLE with two popular penalties, LASSO 
and SCAD, in Section 3. Simulation results are reported in Section 4, which include simulations 
with both Gaussian and non-Gaussian innovations. We demonstrate our method with a real data 
analysis in Section 5, which shows improved performances over the traditional MLE and FPE 
based model selection. We finish with a conclusion in Section 6. Proofs of our results are collected 
in Section 7. Useful lemmas and their proofs are deferred to the Appendix. 

2. Main results 

In this paper we study the PCMLE of the AR{p) model 

Xt^4>iXt-i + --- + c^pXt-p + Zt. (2.1) 

Let Q be the space of parameter vectors 9 = {(pi, • • • , (pp)'^ , Oq =^ {4>i,0i ' ' ' i 4'pfiY' be the un- 
derlying parameter vector, and a{Xt-i, ■ ■ ■ ,Xt-p) be the cr-algebra generated by the random 
variables Xt-i,--- ,Xt-p. Denote by ft{x) := f{x\a{Xt-i,--- ,Xt-^p);9) the conditional den- 
sity function of Xt given Xt-i,- ■ ■ ,Xt-p. Given observations Xi, ■ ■ ■ ,Xjv, the conditional log 
likelihood function L{d) is 

1(9) := L{Xi,--- ,Xn\9) 

N N N 

:= log n MXt)^ E log/t(^t):- E 
t=p+i t=p+i t=p+i 

Here we take the convention logO = 0. As in the literature, the PCMLE of 9 is defined as 

9 := 0A„ := argmaxege{^(^) - ^^a„ (9)}, (2.2) 
where Px^{9) is a penalty function and Aat is a tuning parameter. Further, denote 

Q{9^) ■.^Q{9) ■.= L{9)^NPx,{9). 
We will make the following assumptions for all the results in this section. 

Assumptions 1. 1. The innovations Z, {Zt}1l'^ are independent and identically distributed 
random variables (i.i.d.) with zero mean and variance < 00. 
2. ^{z) := 1 - 4>i^oz (ppfiZ^ for all z € C such that \z\ <l. 

Under the conditions from the first part, the second part of Assumptions 1 is equivalent 
to the causality of the time series AR(p), i.e., there exists a sequence of constants {a;} such 
that X^i^o l*^*! -^t = ^^oO,iZt-i (Theorem 3.1.1, [6]). It is clear that this time 

series is weakly and strictly stationary with EXt = 0. Denote the autocovariance function by 
^{h) = Cov{XuXt+h) - EXtXt+h- 

Let g{z) be the density function of Z . Observe that ft{x) ~ g{x — X]j=i 'f'j-^t-j)- Therefore, 

p p 
MXt) = g{Xt - J2 ^iXt.,) and k{9) = \ogg{Xt - ^ <PjXt_,). (2.3) 
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Especially, 



ItiOo) = logg{Xt - 0j,oXt-j) = loggiZt). (2.4) 



The next theorem gives the conditions such that the PCMLE has strong consistency. 

Theorem 1. Assume that the parameter vector space Q is compact, g{z) is continuous and 
E\\ogg[Z)\ < oo. Further, we assume that {Pxt,{0)} are uniformly equicontinuous in Q and 
P\j^{6) — >■ as N oo for each G 0. Then under Assumptions 1, converges to 9q almost 
surely. 

Usually the vector Oq = (0i,o, • • • , 4'p,o)'^ has some zero components. Without loss of general- 
ity, we assume that the underlying parameter vector 6q = {(f>i,o, • • • , 4>p.oY' has s zeros and these 
zeros are the first s parameters. Then we write 

Oj) = (01,0, ■ ■ ■ , 0p,o) = (0, • • • ,0, 0s+i,o, ■ ■ ■ , 0p,o) := (0^, ^o^i) (^Jq, ^o,i)- 



With the same rearrangement, 9 = (0i, • • • , 0p) := {6-^ q, 9^ 

The results in the rest of this section need the following extra assumptions. 

Assumptions 2. 1. Z has a finite fourth moment. 
2. E ''^'g.,fjf < oo and g ^^Jigj' < oo. 



3. (^) < B uniformly for some constant B. 

Besides, E ^ < oo if £^^^r§f < oo. We denote 



Ci9)--=E'-^^. (2.5) 



Example 1. In the important case Z ^ A^(0,1), g'{z) ~ —zg{z) and g"{z) ~ (z^ — 1)17(2). 
Therefore, C{g) ^ S^$§f - EZ'^ = < 00. E^-C^ ^ E{Z^ ^ if < 00 . (^)" (z) = 0. 

E ^^gi^z) ^ EZ^ < 00. We only require the existence of the fourth moment of the innovation 
in the assumptions. Therefore our results are good for AR processes with heavy tails also. For 
example, the t distributions with degree of freedom df > A satisfy all the conditions in Assumptions 
2. The algebra is tedious but routine. 

In the following propositions and theorem, the penalty function Px^ (9) has the form 

^A„(0)=f]pA„(|0.|). 

Assumptions 3. The assumptions on the penalty p\j^{\(j)\) are 

1. Ajv 0, ^/NXn -^00 as N ^ 00 and liminfAr^oo liniinf0^o+ p'a > 0; 

2. PxA4>) > 0, PxAO) = 0, aw = max{|p^^(|0,,o|)| : 0«,o ^ 0} ^ 0, max{K^(|0,,o|)| : 0»,o ^ 
0} — > as N ^ 00 and p'^'^ exists and is bounded. 

Proposition 1. Assume Assumptions 1, 2 and part 1 of Assumptions 3. With probability tending 
to \, for any given 9i^i with ||0i,i — ^0,1 1 1 = Op{N~^/'^), we have 

Q{0'^,9j,f= max Q{9) 

l|ei,o||<cJV-i/2 

for some constant C . 
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This proposition gives the sparsity of the N~^^^ consistent estimator, i.e., the coefficients 
whose true values are zero are estimated to be exactly zero with probability tending to 1. The 
next proposition is useful to provide the N^^^'^ consistent PCMLE. 

Proposition 2. Assume Assumptions 1, 2 and part 2 of Assumptions 3. Then there exists a 
local maximizer 6 of Q{9) such that \\6 — 9o\\ — Op{N^^^^ + ajv), with probability going to one, 
where apf is defined as in Assumptions 3. 

Under Assumptions 3, if the quantity ajv defined in Assumptions 3 satisfies ajv = 0{N^^/'^), 
the local maximizer of Q{e) in Proposition 2 is a A^-^/^ consistent PCMLE of 8. Therefore, 
from Proposition 1, this estimator has sparsity, i.e., with probability tending to 1 the estimates 
of the zero coefficients are zeros, ^i^o — 0. We list this conclusion as the first part of the following 
theorem. Further, we show that the estimates of the non-zero coefficients satisfy an asymptotic 
normality in the second part of this theorem. 

Theorem 2. Let V he the non-negative definite {p -~ s) x {p — s) matrix with the entry T{1, m) = 
7(m — I), I < l,m < p — s. Denote 

A = diag{p'iJ\(l>,+i,o\)r-- ,p1J\<I>p,o\)} , 

b= (pA„(|(/>s+i,o|)s.gn(0s+i,o),--- ,_Pa„(|0p,o|)s5'^(0p,o))^- 

Assume ~ 0{N^^/'^). Under assumptions 1, 2 and 3, the local maximizer 6 = (0-^ O'^i i)"^ 
of Q{6) satisfies 

1. hfl = 0, 

2. VA[(C(.g)r + A)(0i,i - 0o,i) + N{Q, C{g)T). 
Here the constant C{g) is given by (2.5). 



3. Discussion 



In this section we discuss two popular penalties, SCAD ([10]) and LASSO ([29]). The Smoothly 
Clipped Average Deviation (SCAD) is defined by its first derivative as follows: 

Pa„(I<^I) = A7v/(|0| < Xn) + > A^), (3.1) 

where a > 2 is the second tuning parameter. More precisely. 



(3.2) 



P^M) = \NWm < Xn) + (^101 - - :j7^K(Aiv < \4>\ < a\N) 

a— I 2(a — 1) 2(a — Ij 

(a + l)A^ 
H /(I0I > aXN). 

Further, 

P'LM) = -{a - 1)-'I{Xn < 101 < aAjv). (3.3) 

The Least Absolute Shrinkage and Selection Operator (LASSO) is defined as the absolute value 
of the parameter with a scaling parameter Aat. That is, = AAr|(/)|. For both LASSO and 
SCAD penalty, Aat > 0. 

To have the strong consistency as in Theorem 1, wc require that {Px^iO)} be uniformly 
cquicontinuous and {6) — > as — > oo for each 6 Cz Q. This condition is satisfied simply by 
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setting Ajv ^ for both SCAD and LASSO penalties. Therefore, PCMLE with either LASSO 
or SCAD penalty enjoys the strong consistency. 

Recall the definition of in Assumptions 3. For LASSO, oat — Xn- To have Assumptions 3, 
we need N^^'^gn — > oo. This is a contradiction with qn = 0{N~-^/^). Therefore, from what we 
have proved, there is not enough evidence to claim that the PCMLE (2.2) with LASSO penalty 
has the oracle properties. For SCAD, it is easy to verify that PAjvd^l) > 0; PAjv(O) = by (3.2). 
From (3.1), we have 

lim inf lim inf p X (loiD/A/v — liminf 1 = 1 > 0. 

From (3.3), 

max{|p';^(|0i,o|)| : (pifi 7^ 0} = (a - l)"^/(AAr < < aAjv for some i) (3.4) 

and p'l'^ = 0. Besides, 

QN = ma.x{\p\^{\(j>i^o\)\ : <f>i^o ^0} 

= max{AAr/( min o| < Aat), max — ^P^I{Xn < {(pifll < uXn)}- 

\<t>i,o\^o a-1 

Therefore, un = if min^^ I'Pi-ol ^ uXn- Otherwise, apf ~ 0{Xn). Hence, for the sequence 
{Xn} with Xm and N-'-^^Xn — >■ 00, (3.4) = ajv = if TV satisfies min^^ |0i,o| 5^ oAat. But 
min^. |0i,o| > o-Xn is true eventually if Xn —J- 0. So the PCMLE with SCAD penalty has the 
oracle properties if A^r — > and N^^'^Xn 00. In practice, it is recommended to choose sample 
size N with min^. {(f'ifil ^ oAat after the sequence {Aat} is selected if one has the information 
on min^. I'/'i.olj which can be routinely obtained by traditional estimations like least squares 
or MLE. Ajv should be selected with N^^'^Xn 00 but can be close to N~^^'^. If so, sample size 
N should be chosen with N^'^^'^ = o(min0; I'/'i.ol) but possibly close to (min^^ |0i.o|)~^- 

4. Simulation study 

In this section we look at the performances of the two penalties, SCAD and LASSO, by numerical 
experiments. The simulations are two-fold. On one hand, we simulate data from AR(p) models 
which contain only zero and "large" non-zero parameters. The non-zero parameters are "large" in 
the sense that they are well above the order of 0(iV~^/^), and therefore have very little risk to be 
mistakenly shrunk to by the penalty. The performances of MLE and PCMLE are compared. On 
the other hand, we also consider the cases when some "small" non-zero parameters are involved 
in the model. That is, some non-zero parameters are smaller than 0{N~^^^). Just as we expected, 
the numerical results show no statistical difference between the zero parameter and the "small" 
non-zeros. 

We get a preliminary estimate of the coefficients by the usual MLE, which is next used as the 
initial value for the PCMLE algorithm. In the literature, there are two algorithms to compute 
the PCMLE, both of which are based on polynomial approximations of the penalty functions and 
eventually lead to a modified Newton-Raphson algorithm. The earlier one is the Local Quadratic 
Approximation (LQA) method proposed in [10]. This algorithm essentially iteratively uses the 
Ridge penalty which does not produce zero estimates [14]. In practice the zeros are picked out 
heuristically rather than by the algorithm. This is a drawback since a parameter stays at zero 
after it is determined to be zero at some iteration. A later improvement for the LQA is the Local 
Linear Approximation (LLA) method proposed in [34], which iteratively computes the PCMLE 
with sparsity. Furthermore, the employment of the LLA offers the convenience to take advantage 
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of many standard LASSO algorithms, by which LLA is computationally much more efficient than 
LQA. We therefore choose LLA to compute our PCMLE, specifically, the one-step LLA sparse 
estimator proposed in [34]. The first 80 percent of the sample is used to compute the PCMLE of 
the coefficients, and the tuning parameters are chosen by maximizing the unpenalized likelihood 
on the remaining 20 percent of the sample. 

We first simulate the following AR(5) model with sample size N=1000, 

Xt = 0.2Xt-i + 0.2Xt_3 + 0.2Xt_5 + Zt. (4.1) 

Here, the innovation process {Zt} is generated from standard normal distribution independently. 
Notice that 0.2 is well above the threshold 0{N"^^'^). It is also easy to verify that such a 
combination of coefficients satisfies part 2 of Assumptions 1, the causality. Table 1 reports a 
detailed resufi comparing the performances of MLE, LASSO PCMLE, and SCAD PCMLE. The 
error refers to the L2 norm of the difference between the estimated coefficients and their true 
values. The std refers to the standard error calculated by the sandwich formula [10]. It is clear 
that the SCAD PCMLE detects the zero coefficients. The LASSO PCMLE fails to identify one 
zero coefficient. In addition, SCAD PCMLE has improved estimation errors and standard errors. 

Table 1 

Comparison of MLE, LASSO PCMLE, and SCAD PCMLE for model (4.I). 



Lag 


MLE 


std 


LASSO 


std 


SCAD 


std 


True 


1 


0.2067 


0.0179 


0.1947 


0.0191 


0.2015 


0.0173 


0.2 


2 


-0.008 


0.0179 

















3 


0.2191 


0.0172 


0.207 


0.0183 


0.2139 


0.0166 


0.2 


4 


-0.018 


0.0182 


-0.001 


0.001 











5 


0.1757 


0.0181 


0.1637 


0.0193 


0.1705 


0.0171 


0.2 


error 


0.0373 




0.0373 




0.0326 












0.02 




0.08 






a 










2.1 







We repeat the above process with N ~ 1000 for 100 times independently, and the results are 
summarized in Table 2. The probability to be identified as is calculated by the sample portion 
of the 100 trials for each coefficient. The average bias is the absolute difference between the mean 
value of the 100 estimates and the corresponding true value. The LASSO PCMLE is relatively 
conservative in terms of sparsity. Consequently, approximately for only 1/3 of the 100 times does 
the LASSO PCMLE correctly identifies each of the zero coefficients. In comparison, this pro- 
portion increases to approximately 3/4 for SCAD PCMLE. Especially, the sample probabilities 
of correctly getting 2 zeros for LASSO and SCAD PCMLE are 0.2 and 0.61 respectively. The 
observed biases of SCAD PCMLE are also smaller than those of LASSO PCMLE. We calculate 
the sample probability to get estimate out of 100 independent trials for the two zero coefficients 
</'2,o, 4>4,n, respectively and simultaneously, at sample size N=1000, 1500, 2000, 2500, 3000, 3500, 
4000, and draw it as a function of N in Figure 1. For SCAD PCMLE, the probability for each of 
the two zero coefficients increases from around 0.7 to almost 0.95, as sample size grows from 1000 
to 4000. Whereas for LASSO PCMLE, this proportion mostly varies between 0.3 to 0.4, and does 
not increase significantly as sample size increases. The contrast is even sharper when looking at 
the probability of getting both zeros. For SCAD PCMLE, it increases from 0.51 (N=1000) to 0.9 
(N=4000). However, for LASSO PCMLE, it merely fiuctuates around 0.2, never reaching 0.3. 
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Table 2 

Summary of 100 independent simulations for model (4-1) with sample size 1000. 







LASSO 




SCAD 








probability of 2 


zeros 


probability of 2 zeros 






0.2 




0.61 








Probability of Average 


Probability of 


Average 


Lag 


TRUE 


estimate 


bias 


estimate 


bias 


1 


0.2 


0.01 


0.0298 





0.0282 


2 





0.29 


0.0197 


0.72 


0.0063 


3 


0.2 





0.0309 





0.0297 


4 





0.33 


0.0206 


0.79 


0.0034 


5 


0.2 





0.0250 





0.0340 



Wc further consider the fohowing model: 

Xt - 0.2Xt_i + N^^'^Xt-^ + ^^N-'^'^Xt-z + Zt. (4.2) 

That is, 03.0 and 05. o now have order 0{N~^/'^) for some fixed A^, which is smaller than 
0(iV~^/^). By the foregoing discussion, they may not be detectable from the non-zeros by the 
PCMLE. Same as before, we carry out 100 independent experiments to estimate the coefficients 
in model (4.2) using LASSO/SCAD PCMLE. Consistently, there is no problem with ^i.q. It is 
well above 0, and both LASSO and SCAD penalties distinguish it from for the 100 experi- 
ments. Figure 2 plots the sample probability of zero estimates as a function of sample size for 
the other four coefficients. Notice that, statistically, there is no more difi^erence between the two 
non-zero coefficients 03, Oj 05, o and 02, Oi 04, o shown in the plots. The four plots, referring to the 
four coefficients, look almost identical. 

Finally, we consider models with student t innovations. It is easy to check that the density 
of t distribution with degree of freedom greater than 4 satisfies all the conditions in Theorems 1 
and 2. Therefore the PCMLE is expected to perform as well as that for the normal innovations. 
We simulate samples with length N = 1000, and the degree of freedom of the T distribution 
df = 2, 5. The estimation results of MLE, LASSO PMLE and SCAD PMLE are presented in Ta- 
ble 3. The error refers to the L2 norm of the difference vector between the estimated coefficients 
and their true values. For df = 2, when the condition of Theorem 2 is not satisfied, the errors of 
SCAD PMLE are even higher than those of MLE. 

5. Application to real data 

In this section, we apply the penalized conditional likelihood method to analyze the US Industrial 
Production Index for consumer goods from January 1939 to August 2010 (www.economagic.com). 
The dataset consists of totally 860 seasonally adjusted monthly observations. We use the first 
800 observations for in-sample estimation, and the last 60 for out-of-sample forecast. The first 
order differencing is applied to the original series to get rid of the linear trend. We fit three AR(p) 
models [p = 20, 25, 30) using both the MLE and the SCAD PMLE. An AR(p) model with an 
optimal order p = 24 chosen by the Final Prediction Error (FPE) criterion [2] is also included 
in the comparison. After the model is fitted, the differencing is converted and all forecast values 
are constructed for the original series. 
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Lag 2 coefficient in tfie AR(5) model 



Lag 4 coefficient in the AR{5) model 



■I 0-T 

s' 

0) 

^ 0.5 - 
^ 0.4 - 
3 0.3"- 



■ -•- . LASSO 
- ■ - SCAD 



s ' 

j;:! 0.5 - 

^ 0.4 - 

^ 0.3' -■ 



-•- . LASSO 
■ ■ - SCAD 



1500 2000 2500 3000 

sample size 



Lag 2S[4 coefficients in the AR(5) modei 



2000 2500 3000 3500 4000 

sample size 



B □ 



■ . LASSO 
-SCAD 



2000 2500 3000 

sample size 



3500 4000 



Fig 1. The probability of zero estimates as a function of sample size for: 1) ^2,0 (upper left), 2) <pA,o (upper 
right), 3) 02,0 o.^^ 04,0 simultaneously (lower), in model (4.1). 



Wc use two criteria, Mean Absolute Error (MAE) and Root Mean Square Error (RMSE), to 
evaluate the forecasts. In this example, we choose forecast steps k — 1,6,12. Let m denote the 
total number of forecasts during the period for which the actual value X{t) is known, and F{t) 
denote the forecast value. Then, as in the literature, the MAE and RMSE are defined as: 



ni—k 

MAE = 

s=0 



\F{N + s + k)-X{N + s + k)\ 
m^\X{N + s)\ 



ni—k f 

RMSE = I 



[F{N + s + k)- X{N + S + fc)]2 ^ 

TO*X(iV + s)2 



The comparative results for the forecasts of all the combinations of models and methods are 
summarized in Table 4. The forecast errors of the SCAD PMLE are consistently smaller than 
those of the MLE for all the three models considered. The forecasting performances of the 
associated AR(24) model chosen by FPE are also shown here for the purpose of comparison. All 
the three AR models fitted by SCAD PCMLE, AR(20), AR(25), AR(30), have sparsity. Whereas 
the AR models fitted by regular MLE or the AR(24) model selected by FPE do not have any zero 
estimates at all. The coefficient estimates from all the models and methods are listed in Table 5. 
As seen from this table, lag 24 is very significant. This is why the FPE chooses 24 as the best 
order. However, the cost of choosing such a long-order AR model has obviously resulted in a poor 
prediction accuracy as can be seen from Table 4. The SCAD PMLE picks up only 6 significant 
lags: 2,3,9,18,23,24, which has helped improve the prediction accuracy significantly. Also, the 
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Lag 2 coefficient in tfie AR(5) model 



3500 4000 



Lag 4 coefficient in ttie AR(5) model 



3 3 coefficient in the AR(5) model 



Lag 5 coefficient in the AR(5) model 



Fig 2. The probability of zero estimates for the four coefficients in model (4-Z). 



forecasting errors of SCAD PMLE are quite stable, except that for p = 20 the forecasting errors 
are relatively higher, because one significant lag, lag 24, is excluded from the model. 



6. Conclusion 

In this paper, we propose a new sub-model selection procedure for AR(p) models based on pe- 
nalized maximum likelihood estimators of the coefficients. We prove that the resulting sparse 
PCMLE for the coefficient profile is both strongly consistent and locally N^^/'^ consistent under 
mild conditions. More importantly, under slightly additional conditions, we establish an oracle 
properties for the sparse estimator, analogous to the one by [10] for independent observations. 
It says that the zero coefficients are estimated to be exactly zero with probability going to one, 
and the estimates for the non-zero ones are estimated as efficiently as if the true sub-model 
were known in prior. This property, together with the overall consistency, guarantees that the 
optimal sub-model is selected with probability tending to one, and the estimation efficiency for 
the selected coefficients gets improved by reducing from the full model to the sub-model. What 
is the most important, these are all done by running the model once, saving a great deal of 
computational cost from traditional sub-model selection methods. 

Although the asymptotic theorems look ideal, finite sample performances could be very dif- 
ferent and even misleading. In order to give more guidance for practical use of our method, 
we provide with a thorough discussion on the finite sample properties. We suggest to get some 
preliminary information on the magnitude of the non-zero estimates and design the sample size 
accordingly before running the PCMLE. This way, satisfactory results can be achieved, with 
possibly the smallest amount of observations. 
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df 


Lcig 


MLE 


LASSO 


SCAD 


TRUE 


2 


1 


0.1995 


0.1978 


0.1435 


0.2 




2 


-0.0029 













3 


0.1801 


0.1799 


0.1642 


0.2 




4 


0.0025 


0.0018 










5 


0.1726 


0.1713 


0.1841 


0.2 




error 


0.0341 


0.0351 


0.0687 




5 


1 


0.1831 


0.1806 


0.1825 


0.2 




2 


-0.0048 













3 


0.2161 


0.2151 


0.2159 


0.2 




4 


0.027 


0.0234 


0.0203 







5 


0.1987 


0.1978 


0.2 


0.2 




error 


0.036 


0.034 


0.0311 





7. Proofs 

7.1. Proof of Theorem 1 

We prove by contradiction. See a similar method to show strong consistency in [18]. If 
does not converge to almost surely, there exists a 77 > such that the set F = {lo : 
lim supjY_j.3^ \\6\^{ijj) — 6q\\ > 77} has a positive probability. Since A :~ Q O {9 : \\9 — 9o\\ > t]} 
is compact, for every u €z F, there exists a convergent subsequence {^Ajv. i'^)} such that 



{0w (c.)} -^9eA. 



It follows that 



hmsup ^ f X; lti9o) - N,Px,, (0o) ) 
< hmsup sup f V k{9) - iV,PA„. {9)] 
= hmsup ^[y. (^)) - ^.^A„. (0A„. 

= hmsup I (^)) - (^) + (^) - (^)) 



(7.1) 



= limsup-^ Y h{9x^^{uj)) 
< limsupsup— ^ lt{9) < E svLplf {9) 



t=p+l 



(7.2) 
(7.3) 
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Table 4 

A comparison of forecasting performances of 3 methods: MLE, SCAD PCMLE, and MLE with an optimal 

order chosen by FPE. 



Forecast Evaluation 





Model 


1-step 
MAE 


6-step 
MAE 


12-step 
MAE 


1-step 
RMSE 


6-step 
RMSE 


12-step 
RMSE 


p= 


=30 


MLE 
SCAD 


0.0506 
0.0414 


0.0497 
0.0406 


0.0478 
0.0392 


0.3921 
0.3207 


0.385 
0.3145 


0.3704 
0.3035 


p= 


=25 


MLE 
SCAD 


0.0454 
0.0414 


0.0445 
0.0406 


0.0428 
0.0392 


0.3515 
0.321 


0.3448 
0.3148 


0.3319 
0.3038 


p= 


=20 


MLE 
SCAD 


0.0477 
0.046 


0.0469 
0.0452 


0.0453 
0.0435 


0.3694 
0.3563 


0.363 
0.35 


0.3507 
0.3372 


p= 


=24 


FPE 


0.0461 


0.0452 


0.0435 


0.3571 


0.3503 


0.3371 



(7.2) is from the conditions on the penahy function. We have (7.3) from Lemma 1 since the first 
part of Assumption 1 implies E'log^ \Z\ < oo. On the other hand, 

1 ^ 

(7.1) = hm - ^ h{eo) - hm P^„(0o) = i?/t(0o) (7.4) 

t=p+l 

by the condition on the penalty function and Lemma 2, part 1. Therefore, Elt{6o) < E sup^gy^ 
with a positive probability. But supg^^lt{9) — lt{Q is) for some 0a G A by the the continuity of 
/t(-). This is a contradiction with Lemma 2, part 2 since ||0a — 0|| > ?7 > 0. 



7. 2. Proof of Proposition 1 

We follow the pattern of the proof of Lemma 1 in [10]. However, in our case, the estimation of 
the orders is completely different from theirs. We are considering a dependent case while theirs 
is for i.i.d. random variables. 

To show (5(0"^, 0^j^)^ = maX||Q^ Q{d), it is sufficient to have 

< for < 0, < and > for - CN-'/'' < 0, < (7.5) 

ocpj ocpj 

for 1 < J < s. By Taylor's expansion, 

= ^ " iVp,„(|0,|).gn(</>,) = + ^^{<t>. - 0.0) 
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Table 5 

Estimated values of the coefficients for all the combinations of models and methods. 



Estimated Coefficients 
p=30 p=25 p=20 p=24 



Lag 


MLE 


SCAD 


MLE 


SCAD 


MLE 


SCAD 


FPE 


1 


-0.0486 





-0.0458 





-0.0549 





-0.0431 


2 


0.0955 


0.1031 


0.0928 


0.1054 


0.0867 


0.0966 


0.0923 


3 


0.083 


0.0819 


0.0864 


0.084 


0.0948 


0.0867 


0.0864 


4 


0.0519 





0.0465 





0.0461 





0.0475 


5 


-0.0212 





-0.0236 





-0.0225 





-0.0239 


6 


0.0594 





0.0636 





0.0524 


0.0639 


0.0627 


7 


-0.0146 





-0.0134 





-0.0178 





-0.015 


8 


-0.0021 





0.0028 





0.0081 





0.003 


9 


0.0903 


0.0969 


0.0935 


0.0964 


0.092 


0.0935 


0.0944 


10 


0.0499 





0.0535 





0.0546 


0.058 


0.0535 


11 


-0.0031 





-0.0009 





0.0004 





-0.0012 


12 


0.0432 





0.0409 





0.0318 





0.0409 


13 


0.0087 





0.0065 





0.009 





0.0056 


14 


-0.0124 





-0.0115 





-0.0109 





-0.0116 


15 


0.0039 





0.0034 





-0.0114 





0.0028 


16 


-0.0446 





-0.0405 





-0.0406 





-0.042 


17 


-0.0093 





-0.0025 





0.0002 





-0.0021 


18 


0.0827 


0.0936 


0.0865 


0.0931 


0.0736 


0.0715 


0.0867 


19 


0.0359 





0.042 





0.046 





0.0409 


20 


0.0369 





0.0429 





0.0461 





0.0435 


21 


-0.0416 





-0.0425 









-0.0433 


22 


0.0175 





0.023 









0.0217 


23 


0.0708 


0.0768 


0.0751 


0.0768 






0.0736 


24 


-0.1508 


-0.1373 


-0.1439 


-0.1363 






-0.1435 


25 


-0.0219 





-0.0169 











26 


0.0368 















27 


0.0063 















28 


0.0519 















29 


0.0313 















30 


0.0069 
















Here, 6* is between 6 and 6q. By the observation (2.3), it is easy to see that 

dL{9) _ dlt{9) _ d\ogg(Xt^(f>iXt-i cf>pXt-p) 



'^^ t=p+l t=p+l 

N 



- E 



g'{Xt - (jiiXt-i (jipXt-p) 



J gi^t - (t)iXt-i (ppXt-p) 



3' 



d^L{e) A {g"g~{g'f \ , 

= I -2 ] " ^iXt^l <i>pXt^p)Xt-JXt-^, 



t=p+i 



d'^L{0) ^ f'j'\" 

^ '— I {Xt — ipiXt-i — ■ ■ • — (f)pXt-p)Xt-jXt-iXt^k- 
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Therefore, the first term of (7.6) is 

In the seeond term of (7.6), 

d'L{9o) _ ^ g"{Z,)g{Z,)^{g'{Z,)f 

First, we estimate the order of (7.7), the first term of (7.6). Since 
and for s < t 

- - 0, (7.10) 



we have 



= (iV-p)C(g)7(0). 

Therefore ||^^^|| = ©(iV^/^). By Chebyshev's inequafity, we have 

Next we estimate the order of the second term of (7.6). 

^ g"{Z,)g{Zt) - {g'{Zt)f f 

E gHZt) ^ J ~ ^ ^^(9) 

by the assumptions on g. Denote = |^ g''(^t)g(^^tMg'(^t))' ^ (^(g)^ Xt_jXt_,. Then EYt = 
EYtYs = for s ^ and 

. ^( ■^"'^■'^'^-;-/^''^-»% c(,))%x.^^^ 



5 (^t) 5-^(^t) 9 [Zt) I 



is finite by the assumptions on g and Lemma 4. Notice that ^^''g^zl)'^^ < od by Cauchy- 
Schwarz inequahty. Then by the weak law of large numbers (Theorem 8.3.2, [9]), X)tLp+i ^t/N 
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converges to in probability. Besides, X)t!=p+i ^t-jXt-i/N converges to 7(j — i) in probability 
by the ergodicity of (2.1). See the ergodicity of (2.1) in the proof of Lemma 1. Therefore, 

IN converges to — C(g)"f(j ~ i) in probability. (7-12) 

Hence, the second term of (7.6) has order Op{N^/^). Besides, g^.a^^g],^ = op(N^/^) by Lemma 

3. Therefore, the first three terms of (7.6) have order Op{N^/'^) by the condition on 0. By the 
conditions on and p'\^{4>j)^ the last term Np'-^^[\(f)j\)sgn{(j)j) is dominating the other three 
terms in (7.6). (7.5) is established. This completes the proof. 

7. 3. Proof of Proposition 2 

This theorem is a version of Theorem 1 in [10] for dependent random variables. We sketch the 
proof for completeness. Let biq = N~^^'^ + oat and u = (ui, • • • , Up)^ . To show the existence of a 
local maximizer with 0o|| = Op(feAr), for any 77 > 0, it is sufficient to have P(sup||^ll^(^ (5(0o+ 
bNu) < Q{9q)) > I — rj for some large constant C . In our case, 

DNiu) := Q{eo + bNu)-Q{eo) 

p 

< L{eo + bNu) - Li9o) - N ^ {PA„(|</'j,o + 6jvWj|)-PA„(|0j,o|)} 



]=s+l 

1 P P P Q3j^(0*\ 

< bNL'iO^fu + -U^H{eo)ub% + \J2J2J2 Zrll A 



- N bNp\,i\^j,o\)sgni^,,o)uj ~ - b%plJ\^j,o\Wj{l + oil)} 

j=s+l j=s+l 

Here the gradient L'(0o) = (- ELp+i ^^l§^^t-i, ■■■ ,~ Et^p+i ^^^^t-pV , and the matrix 

H{e,) = (a.,)pxp with a,, = 1^ ^ ^^^^^^ g"(Z0g(z,Mg'(^.))- ^^_^x,_,. By (7.11), we 

have = Op(iVi/2). Then the first term has order Op{N^^'^bN) = Op{Nb%). Notice that 

dij = Op{N) by (7.12). Therefore, the second term has order Op{Nb%). The third term has 
order Op{Nbj^) = Op{Nb%) by Lemma 3 and the condition on qn in Assumptions 3. Recall that 
6jv = N^^/^ + ajv. It is obvious that the fourth term has order Op{Nb^) and the fifth term has 
order op{Nb'j^). Then the second term dominates the others by choosing a suflaciently large C. 
Let S be the non-negative definite p x p matrix with the entry "f{j — i) at row j and column 
i , ^ < i,j < P- Again by (7.12), H{eo)/N converges to -C(.g)I] with C{g) > 0. Therefore 
the second term is non-positive with probability tending to 1. This finishes the proof of the 
proposition. 
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7.4. Proof of Theorem 2 

We only need to show the second part. Let 9 = (0, be the A^~^/^ consistent local maxnTiizer 
of Q{9) with =0 for j = s + 1, • • • ,p. Then we have 

dQje) am , ^ ~ 

P a2 



dL{9o) , ^ ^ d^LjOo) , ^ 



= 0. (7.13) 

Here the Op{N) in the second term is from Lemma 3 and the N^^/"^ consistency of The 
Op(l) in the last term is from the property of p'x^ - Let 

AfVfl ^ (dL{9^ dL{9o) 

be the gradient vector and A/"(0o,i) be the second partial derivative Hessian matrix of L{9) at 
00,1- Then the matrix form of (7.13) is 

M'{9oa) + M"{9oa){Oi,i - 9o,i) -Nb^ iV(A + op(l))(^i,i - 6>o,i) = 0. 

Divided by Vn, together with some algebra, we have 



^ y/N{C{g)r + A + op(l))(^i,i - 6>o,i) - = 0. 

By (7.12), 

-^M"(0o,i) + Cig)T ^ in probability. 

Therefore, to have the second part of Theorem 2, we only need to prove --^M'(0o,i) ^ 
N{Q,C{g)r). By Cramer- Wold device, it is enough to have 

^ E E ^«^.7(*-j)) (7.14) 



for any vector A ~ (A^+i, • • • , Ap)-^ with ||A|| ^ 0. Let J^t be the cr-algebra <t{Xi, ■ ■ ■ ,Xt) and 
Et{-) := E{-\Tt) be the conditional expectation. 

" dL{9o) ^ g'{Z,) ' 



j=s+l t=p+l ^' j=s+l 



is a Martingale since 



g{Zt) ^ ^ ' \g{Zt) 

^' j=s+l I \j=s+l I ^1 
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Now we can use the Lindeberg condition given in [8] for the Martingale central limit theorem. 
First, we verify the condition (1) on page 60 in his paper. 



E,. 




Vn ■= and 



N 



= EV^ = C{9) E ^ E 

t=p+i \j=s+i J (7.15) 

= {N~ p)C{g) E - ^ 0{N). 

s+l<i,j<p 



Then 



_ St=p+1 (Z]j=s+1 ^J^t-j^ 

^n/^n ^ — ~ z — ' 72 

_ 2^j=s+i ^3 l^t=p+i ^t-j + ^ l^j=s+i 2^k=j+i ^j^fc 2^t=p+i M-]M-k 

To show / sj^ — >■ 1 in probability, it is sufficient to have 

N^^ 17 ^ 1 in probability (7.16) 



X 

and 



jSr^'^r^r ^1 in probability (7.17) 
E l^t=p+i Xt-jXt-k 

for any s + l<j<k<p. (7.16) and (7.17) are true from the ergodicity of (2.1). See the 
ergodicity of (2.1) in the proof of Lemma 1. Now we verify the Lindeberg condition as in [8]. Let 
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Pt{-) — P{-\J-t) be the conditional probability. For any e > 0, 



9'iZt) / 



g'iZt) ^ 



< 



< 



giZt) ^ ' 



1/2 



1/2 



9{Zt) 



1/2 



,2„2 
t 6^ 



1/2 



E 



{9'{Zt)\ 
\9iZt)) 



1/2 



\9{Zt)) 



1/2 



By a similar argument as in Lemma 3, J2j=p+i {J2^=s+i ^j^t~jj ~ op{N^/'^). Therefore, 



j=p+i 

Op(l). 



j=s+l 



j=s+l 



J 



The Lindeberg condition is satisfied. Finally, the variance in the central limit theorem (7.14) is 
from the calculation (7.15). This finishes the proof. 



8. Appendix 

Lemma 1. Assume the AR model (2.1) is causal, E\og^ \Z\ := i?{max(0, log |Z|)} < oo and 
g{z) is continuous. Let A be a compact subset of the parameter space 6. Then {sup^g^ ^t(0)} is 

strictly stationary, ergodic and 




Proof. Let X := [Xt.Xt^i, • • • , Xt-p)^ be the vector of p + 1 random variables. Recall (2.3). To 
emphasize the dependence of lt{0) on X, denote /t(X, 0) lt{S) = logg(^t — Sj=i 't'j^t-j)- 
Then ?t(X, 6) is continuous by the continuity of g{z). We claim that supg^jj ^t(X, 6) is continuous 
with respect to X, for any compact subset 11 C A. Assume by contradiction that supggjj /t(X, 6) 
is not continuous at X(o). Then there exists an e > such that for all (5 > 0, there exists a X(i), 

||X(i) - X(o)|| < ^ and | sup ?4(X(i), 0) - sup Zt(X(o), 0)| > e. 

sen eeu 
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By the continuity of /t(X, 0) with respect to 9 — (01, • • • , (fip)'^ , supg^jj ^t(X, 9) is attained in 11 
for each X and each compact subset 11 of A. Denote 

^t(X(o),0(o)) = sup;t(X(o),0) and ;t(X(i), = sup ;t(X(i), 0). 
een een 

Without loss of generality, assume Zt(X(i), > Zt(X(o), 0(o))- Then 

't(X(i), 0(1)) - ;t(X(o), 0(1)) > ^t(X(i), 0(1)) - /t(X(o), 0(0)) > £• 

This is a contradiction with the continuity of Zt(X, 0) with respect to X. Hence sup^gu 't(X, 0) 
is continuous with respect to X, for any compact subset 11 C A. Consequently supg^jj /f (X, 0) 
is B measurable, where B is the Borcl cr-algebra on Rp+^. This verifies the second condition of 
Theorem 3.10 in [23]. The other two conditions arc obvious by the continuity of ^t(X, 0). Besides, 
A is a compact set in 6. Therefore, by Theorem 3.10 of [23], there exists a B measurable function 
^(X) = (v3i(X),--- ,(^p(X))^ : ^ A such that 

p 

sup;,(X,0) = \oggiXt - Vv^,(X)X,_,) (8.1) 
eeA 

Since the AR(p) model (2.1) is causal and E'log''' \Z\ < oo, (2.1) is strictly stationary by Theorem 
1 in [7]. On the other hand, (2.1) is ergodic: Zt has a continuous density function g(z) implies 
that its law is absolutely continuous with respect to the Lebesgue measure on R. Therefore, (2.1) 
is strong mixing ([22]) and then is ergodic (problem 24.3, [4]). By (8.1) and the continuity of g{z), 
the time series {supgg^ ^t(^)} is strictly stationary and ergodic (Theorem 36.4, [4]). Therefore, 

1 ^ 1 ^ 

limsupsup— > Zi(0) < limsup — > sup ^^(0) = sup ^t(0) a.s. 

AT^oo ogaN ^^^^ AT^o, ^.i^ti^eA esA 

□ 

Lemma 2. A.ssume Z,Zt are i.i.d. and i?|log g(Z)| < oo, we have 

1. lim^^oo ^ HOo) = E\ogg{Z) = Ek{9^); 

2. Elt{9o) > Elt{9) with equality if and only if 9 ^ 9q. 

Proof. 1. Recall Zt(0o) = log5(Zf) from (2.4). Then \Elt{9Q)\ = \E\ogg{Z)\ < oo. By the law of 
large numbers, 

1 ^ 

t=P+i 



1 ^ 

J™ 1^ E ^og g{Zt)^E log giZ)^Elt{9o) a.s. 



t=p+l 

2. First, we show E log g{Z + C) < E log g{Z) for any constant C and the equality holds if and 
only if C = 0. Recall g is the density function of Z. Obviously, the equality holds if C = 0. If 
C 7^ 0, by the strict concavity of the logarithm function, 

E log g{Z + C)~E log g{Z) = E log ^^^j^ < log E^^^±^ 

9{Z) 9{Z) ^32) 



= log J g{z + C)dz = 
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(8.2) is a simplified version of Example (1.3) in [23]. We provide the proof here for completeness. 
Now let X = (<^i_o — (j)i)Xt-i + ■ • • + (0p,o — 4'p)^t-p- Since Zt is independent of X, by (8.2), 
wc have 

Eltie) = E\ogg{Xt-4>iXt-i 4>pXt-p) 

= E\ogg{Xt - (t)iftXt-i (t)p^^Xt-.p + (01,0 - 4>i)Xt-i H h ((^p,o - 4>p)Xt-p) 

= E log g{Zt + X) = E{E{logg{Zt + X)\X)} 
< E{E\ogg{Zt)) ^E\ogg{Zt) = EkiOo). 

This completes the proof. □ 
Remark 1. In the case Z, Zt ^ N{0, 1), 

Eltie) = Elogg{Zt + X) = -^log27:-^E{Zt+Xf 

1 1 o 1 . 

= ~-log2n--EZ^--EX' 
= EloggiZ)-^EX^ = Elt{eo)-^EX\ 
Obviously, the second part of Lemma 2 is true in this case. 

Lemma 3. Assume Assumptions 1 and part 3 of Assumptions 2. Further, assume Z has the 

first three moments. Then Yl!t=i (^) i^t ~ 4>iXt-i (l)pXt-p)XtXt-iXt-k = Op{N), for 

any given integers i, k. 

Proof Let A = max{E\Zi\^, EZfE\Z2\, {E\Zi\)^). Under the causality condition, Xt = J2T=o "-j^t-j 
for a sequence of constants a j with j^o I "^i I ^ ■ 



oo oo oo 



E\XtXt-iXt-k\ — E\ ^ ^ ^ ajGpOqZt-jZt-i-pZt-k-q 
j=o p=o q=a 



oo oo oo 



< ^ ^ ^ E\ajapaqZt-jZt-i-pZt-k-q\ (8.3) 

j=0 p=0 q=0 



oo oo oo 



< A^^^laj-apflgl =A(^|aj|)3 <oo. 

j=0 p=0 q=0 J=0 



Therefore, 



E\J2[-] XtXt-^Xt-k\<J2E\(^] XtXt-^Xt-k\ 
t=i \ ^ / t=i V 5 / 

N 

< Bj2E\XtXt-^Xt-k\ = 0{N). 
t=i 

Then by Markov's inequality, the desired result follows. □ 

Lemma 4. Assume that the innovations {Zt} are i.i.d. random variables. Under the causality 
condition, Xt in the AR(p) model (2.1), has the to*'' moment if the corresponding innovation 
Zt has the m*'' moment. When the innovation has the fourth moment, EXfX'^^^. < oo for any 
given integer k. 
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Proof. Under the causality eondition, Xt = Y^'j^gO'jZt-j for a sequence of constants aj with 

Sjlo l% l < Without loss of generality, we assume that X^jlo I'^il = ^' ^^"^ 7^ ^ ^^'^ 
j = 1, 2, • • • . By the convexity of the function \x\"^, 



< \ao\ \Zt\" 



< \ao\ \Zt 



' CXD 



|ai||Zt_i| +E^2 kjll^t- 



^t-i| 



00 

Eki 



7^1 



00 I I 

j=2 



= laollZil™ + |ai||Zi_i|™ 

00 

<...<EKii^*-.r- 

Taking expectations on both sides, we obtain: 

E\Xtr < E\Z\"' < 00 

for any positive integer m. Now, 



W2\\Zt^2\+ET=3Wj\\Z, 



EXfXl, < E 



^ / oc 00 \ 

E + E"j^*-fc-j 



[^^ E + E 

^ / k~l 00 



Obviously, Yt := Ej=o '^jZt-j + Ej^A;('^j + o.j-k)Zt-j is also a causal linear process with in- 
novation {Zt}. Since < 00, it follows that < 00. This proves the desired result. An 
alternative proof of this Lemma can be given by using a similar argument as in (8.3). 

□ 
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